- Cluster analysis (kmeans and hierarchical)
- Reading raster files
- Principal component analysis
Cluster analysis assignes each data point to one of k clusters, thereby maximmizing the between-cluster variance and minimizing the within-cluster variance.
There are two principal types of cluster algorithms: partionating algorithms (e.g. kmeans) and hierarchical algorithms.
A kmeans cluster analysis can be performed using the kmeans() function:
data(iris) iris_kmeans <- kmeans(x = iris[, c(1:4)], centers = 3)
## K-means clustering with 3 clusters of sizes 38, 62, 50 ## ## Cluster means: ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 1 6.850000 3.073684 5.742105 2.071053 ## 2 5.901613 2.748387 4.393548 1.433871 ## 3 5.006000 3.428000 1.462000 0.246000 ## ## Clustering vector: ## [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 ## [36] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ## [71] 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 1 ## [106] 1 2 1 1 1 1 1 1 2 2 1 1 1 1 2 1 2 1 2 1 1 2 2 1 1 1 1 1 2 1 1 1 1 2 1 ## [141] 1 1 2 1 1 1 2 1 1 2 ## ## Within cluster sum of squares by cluster: ## [1] 23.87947 39.82097 15.15100 ## (between_SS / total_SS = 88.4 %) ## ## Available components: ## ## [1] "cluster" "centers" "totss" "withinss" ## [5] "tot.withinss" "betweenss" "size" "iter" ## [9] "ifault"
We can plot the cluster solution using ggplot:
iris$cluster3 <- iris_kmeans$cluster p <- ggplot(iris, aes(Sepal.Length, y = Sepal.Width, col = factor(cluster3))) + geom_point()
And compare it to the acutal species classification:
table(iris$cluster3, iris$Species)
## ## setosa versicolor virginica ## 1 0 2 36 ## 2 0 48 14 ## 3 50 0 0
More/less clusters lead to different solutions:
iris$cluster2 <- kmeans(x = iris[, c(1:4)], centers = 2)$cluster iris$cluster4 <- kmeans(x = iris[, c(1:4)], centers = 4)$cluster iris$cluster5 <- kmeans(x = iris[, c(1:4)], centers = 5)$cluster iris$cluster6 <- kmeans(x = iris[, c(1:4)], centers = 6)$cluster p <- iris %>% gather(key = k, value = cluster, -(Sepal.Length:Species)) %>% ggplot(., aes(Sepal.Length, y = Sepal.Width, col = factor(cluster))) + geom_point() + facet_wrap(~k, ncol = 6)
Hierarchical clustering can be performed using the hclust() function. However, we first need to calcuate the distances (similarities) between each data point pair:
dist <- dist(iris[, 1:4])
There are several methods for calculating the distance (e.g., euclidean, manhatten, …), see ?dist(). With the distance matrix we can hierarchically cluster all data points:
iris_hc <- hclust(dist)
And plot i using the ggdrndro package:
library(ggdendro) p <- ggdendrogram(iris_hc)
See https://cran.r-project.org/web/packages/ggdendro/vignettes/ggdendro.html for further information.
For todays exercise we are going to use data that is stored in a raster stack, that is a stack of gridded rasters with matching spatial resolution and extent.
Reading a raster stack can be done using the stack() function in the raster package:
library(raster)
clim <- stack("Data/climate.envi")
Raster stacks can be of any format (geoTiff, ENVI, grid, bsq, etc.) and with or without spatial reference. For reading a single raster instead of a stack, you can use the raster() function.
clim
## class : RasterStack ## dimensions : 709, 1306, 925954, 4 (nrow, ncol, ncell, nlayers) ## resolution : 0.008333333, 0.008333333 (x, y) ## extent : 16.90833, 27.79167, 44.33333, 50.24167 (xmin, xmax, ymin, ymax) ## coord. ref. : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0 ## names : Band.1, Band.2, Band.3, Band.4 ## min values : 0.08333333, -52.83333333, -26.41666667, 38.41666667 ## max values : 173.2500, 77.2500, 124.6667, 146.6667
names(clim) <- c("Tmin", "Tmean", "Tmax", "Precip")
Plotting the raster is easily achieved by the plot() command. However, you have to select a layer for plotting:
plot(clim, y = 1)
Rasters can have millions of pixels, which can make computations very slow. To deal with this issue, we are going to sample 10,000 pixels from the raster:
clim_samp <- sampleRandom(clim, size = 10000) clim_samp <- as.data.frame(clim_samp) # sampleRandom returns a matrix clim_samp[1:5, ]
## Tmin Tmean Tmax Precip ## 1 158.8333 54.75000 106.50000 49.25000 ## 2 119.9167 38.41667 79.00000 54.33333 ## 3 131.6667 36.41667 83.75000 57.91667 ## 4 141.5833 50.58333 95.91666 57.33333 ## 5 104.9167 21.25000 62.91667 61.75000
library(GGally) p <- clim_samp %>% sample_n(., size = 100) %>% ggpairs(.)
There are two functions in R for performaing a PCA: princomp() and prcomp(). While the former function calculates the eigenvalue on the correlation/covariance matrix using the function eigen(), the latter function uses singular value decomposition, which numerically more stable. We here are going to use prcomp().
pca <- prcomp(clim_samp, scale. = TRUE)
You should set the scale. argument to TRUE, to scale the data matrix to unit variances. The default is scale. = FALSE. You can also use scale() to standardize the original data matrix and then enter the standardized data matrix to prcomp.
The prcomp object stores the rotations (eigenvectors) which define the contribution of each variable to a component:
pca$rotation
## PC1 PC2 PC3 PC4 ## Tmin -0.5277307 -0.2333982 0.679557980 -4.530193e-01 ## Tmean -0.5274517 -0.2265064 -0.732016319 -3.669355e-01 ## Tmax -0.5324369 -0.2324769 0.048308505 8.124849e-01 ## Precip 0.3997574 -0.9166104 -0.004398646 -4.017791e-05
The eigenvalues are also stored in the prcomp object:
pca$sdev
## [1] 1.850517965 0.734087442 0.191547478 0.002907044
And can be easily translated into variance explained:
var_exp <- data.frame(pc = 1:4,
var_exp = pca$sdev^2 / sum(pca$sdev^2))
var_exp$var_exp_cumsum <- cumsum(var_exp$var_exp)
p <- ggplot(var_exp) +
geom_bar(aes(x = pc, y = var_exp), stat = "identity") +
geom_line(aes(x = pc, y = var_exp_cumsum))
We can now apply the PCA transformation to the whole raster stack:
clim_pca <- raster::predict(clim, pca, index = 1:4)
Let's combine PCA and kmeans:
pca_values <- as.data.frame(clim_pca) pca_kmeans <- kmeans(pca_values[, 1:3], centers = 5) kmean_raster <- raster(clim_pca) values(kmean_raster) <- pca_kmeans$cluster plot(kmean_raster)